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Abstract 

We demonstrate the consistency of cross validation for comparing multiple density 
estimators using simple inequalities on the likelihood ratio. In nonparametric prob- 
lems, the splitting of data does not require the domination of test data over the 
training/estimation data, contrary to Shao (1993). The result is complementary to 
that of Yang (2005) and Yang (2006). 
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1 Introduction 



Cross validation (CV) is a common procedure used for smoothing parameter 
estimation or model selection. When multiple procedures are being compared, 
it is customary to choose the one that obtains the smallest "loss", which is 
defined specifically to the problem at hand. If both the procedure used to 
obtain the estimate and the computation of the loss are based on the same set 
of data, it is a well-known effect that the estimated loss is biased due to the 
double use of the same observations. In order to obtain an unbiased estimate, 
one approach is to use a penalty term that takes into account the complexity 
of the model. This approach includes AIC, BIG, Cp, etc. A simpler approach, 
which is closely related, when we have the luxury of enough observations, is to 
split the data in such way that one part is used to obtain the estimate, and the 
other separate hold-out data is used to evaluate the loss. The main advantage 
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of this approach is that it can be easily apphed in various situations (without 
theoretical derivation) to select one out of many competing procedures. There 
are a few different variations to this approach, including leave-one-out CV, 
k-fold CV, Generahzed CV, etc. A well-studied case is the hnear regression 
problem, where Shao (1993) showed the surprising result that leave-one-out 
CV will select the models with extra redundant variables with nonvanishing 
probability (This theory assumed that the number of covariates is fixed. It is a 
different story when the number of covariates grows with n). In order to restore 
consistency, one should split the data such that the size of the evaluation part 
of the data is dominating. All of the above techniques are summarized and 
compared in Shao (1997) in the context of linear regression with different kinds 
of asymptotics. 

Yang (2005) studied the problem of cross-validation in the context of non- 
parametric regression comparing a finite number of estimators. It is shown in 
that paper that under the L2 loss, as long as one of the competing procedures 
converges at a nonparametric rate, the dominance of evaluation data is not 
necessary for consistency. Instead, the two parts of the data can be of the 
same order, which is surprising considering the corresponding result for linear 
regression. The proof of Yang's result is based on an application of Bernstein's 
inequality. Similarly, it is shown in Yang (2006) that cross validation is con- 
sistent in classification problems where the consistency also depends on the 
rate of disagreement between the two classifiers. 

In this paper, we consider the problem of density estimation when the ob- 
servations are generated i.i.d. from the true distribution Pq- There exists a 
large literature on density estimators, earlier results focus on linear estima- 
tors including kernel density estimator, later developments include wavelets 
thresholding and adaptive width kernel that achieve minimax rate of conver- 
gence in a large class of Besov spaces where no linear estimate can attain 
the optimal convergence rate. Faced with such large choices of estimator with 
different theoretical properties, it is important to select one that has optimal 
performance for the current problem. Cross validation can be directly applied 
by splitting the observations into two groups. Different estimators, such as 
kernel estimates and wavelets, can be obtained based on the first part of data, 
then the likelihood of the second group of data can be evaluated and com- 
pared. Finally, the estimator that obtains the largest likelihood is chosen as 
the winner. A natural question is whether this process will return the optimal 
procedure. In particular, what condition on the splitting ratio should be satis- 
fied in order to ensure the consistency property? The main conclusion in this 
paper is similar to that of Yang (2005), that is, we do not necessarily need 
to assume the size dominance of the evaluation data as in linear regression 
problem. 
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2 Consistency result 



Consider the situation where we have n observations Xi,X2, . . . gener- 
ated i.i.d. from an underlying distribution Pq with corresponding density po- 

There exists many well-known density estimators. For parametric procedures, 
such as mixture modeling using parametric families of densities, the conver- 
gence rate is usually l/\/n. On the other hand, for nonparametric procedures, 
the rate of convergence is slower, depending on the smoothness property of 
the true density. Density estimation is closely related to regression problem, 
as demonstrated in a scries of well-known papers (see, e.g.. Brown and Low 
(1996), Nussbaum (1996)). 

With many possible choices for density estimation procedures, both parametric 
and nonparametric, one needs to find the best estimator among them for 
the current data. Parametric procedures have a faster rate of convergence 
when the model is correct, but suffer from a nonvanishing bias when the true 
distribution lies outside of the parametric family. Nonparametric procedures 
are more flexible but lose in efficiency when the underlying density is of a 
known parametric form. In practice, we need to know which procedure is best 
without knowledge of the true distribution. 

We start by splitting the data into two parts: the estimation data — 
{Xi, . . . , and the evaluation data X"^ — . . . , and let n2 — 

n—ni. We assume we have many estimation procedures {-Pj}™"i (note the num- 
ber of potential choices can grow with n), which will produce density estimates 
{pi^^\x\ Xi^ . . . ,X„J}, we will omit the dependence of Pj-"^^ on the training 
data X^ in the following. To choose the best procedure among those m^, the 
test data X"^ is used to evaluate the hkelihood: p-"'^^(X^) = Y{k=n^^iPr^\Xk)- 
If 'p^^^\x'^) — maxjPj-"^^(X^), then the procedure pj"^^ is selected as the fi- 
nal estimator. The desired property is that this cross-validation procedure 
will select the best one with high probability. We will use a loss function 
d{pQ,p) to measure the closeness of p to the true density po- In this paper, 
we will adopt the commonly used Hellinger distance as the loss function: 
dH{Po,p) — (/(^/Po — Vp)^)^^^- Another commonly used measure of loss in 
the context of density estimation is the Kullback-Leibler divergence (which is 
not a true distance) dK{po,p) = I Pology. It is always true that d.'j^ < dx- 
Under some mild assumptions, these two measures of loss are almost equiva- 
lent. The simplest case under which this is true is when the class of densities 
considered are uniformly bounded away from zero and infinity, so that the ra- 
tio ^ is uniformly bounded, then dxipajp) — 0{d'jj{pQ,p)) (see, e.g.. Lemma 
8.2 in Ghosal et al. (2000)). The more complicated techniques similar to those 
used in section 3 of that paper can also be used when the estimate is con- 
strained to be within a finite approximation set (this will result in an extra 
logarithmic factor). The paper of Yang and Barron (1999) contains more infor- 
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mation regarding the relationships between fi^ . dx and the Lp loss function, 
and established some equivalence result between them under some conditions. 
In the rest of the paper, we will assume that dx < Mdjj for some constant 
M. 

Theorem 1 Po{p^^'\x^) > pC'\x'^)yi > 1\X^) 1 if the following con- 
ditions hold: 

(1) rii ^ 00,^2 oo 

(2) n2 minj>i ^ oo, and logm„ = o(n2 minj>i v^^ J 

-[-5 f 

(3) There exists c < 1 and > s.t. ntn — ^ oo and — '^^ <^ x \/^ ^ x 

where Vn,i = dH{po,pt'^), = V{po,pf'), and V{po,p) = /po(log^)^ 

Remark 2 In the statement of the theorem, the probability is conditioned on 
X^, and Vn^^i and in conditions (2) and (3) are random variables, so the 
theorem should be interpreted as Pq{p'i^\x'^) > p\'^^\x^),\/i > l\X^) — > 1 on 
the set that the conditions (l)-(3) hold. 

Proof. The proof is based on simple likelihood ratio inequalities in Wong and 
Shen (1995). All the probabilities below are implicitly conditioned on the 
training data X^. Prom Lemma 1 in Wong and Shen (1995), we have, for 
1,6> 0, 



^o( ^^^^2) ^ exp[-n2h))<exp[— ^) 

Choosing h — cv^_^ ^ (c as in the above assumption (3)), we get 



^o( ^^^^2) ^ exp(-n2CT„^,J) < exp{ ^(1 - c)) 

Denote by W^^, the event { '^ ^-f^^f"'^^ ^ > where d&^^PcpS"^^) 

is the empirical version of dxiPoiPi^'') on evaluation data X^: 



(n2)/ -(ni)^ 1 -A , Poi^i) 
"•2 i=„j+i (^Ajj 



. By Chebyshev's inequality, -Pq (1^712) ^ Denoting d = Mv^^ ^ + Snitn2, 
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The last inequahty above holds since on the set 



we have 



Note that when d = Mv^^^^+Smtn^ < cv^2,i (assumption (3)), ^^^^^^2) > e~"^'^ 



and ^'p^-^f^) < e~"'^^^"^'* implies p^^'^\x'^) > p-"^-'(X^). So we can bound the 
probability that the cross validation procedure chooses an estimator other 
than Pi. 



Po{j,t'\x^) < p'r'\x^) for some i) 

< P„r^L_kj_i < p-r^2d^ , V- p ( Pi \^ ) -n^cvl 



712(1 — c) 2 \ 

< PoiWn^) + mnexp{ mm J 

The above expression converges to zero under the assumed condition (2). □ 



Remcirk 3 The above theorem cannot be directly applied since Vni,i are ran- 
dom variables depending on . We will specialize the result to the two pro- 
cedures case below in Corollary 1. 

Remcirk 4 Under some mild conditions, we will have — V{po,Pi^^^) — 
^(^lf(PO)Pi"'^)); see, e.g.. Theorem 5 in Wong and Shen (1995). 



5 



Remark 5 As in Yang (2005), we can consider the case where we have mul- 
tiple different splittings of the original data. Cross validation as stated above 
can he applied to each splitting separately, and then use majority vote to choose 
the final procedure. 

Now we give a definition comparing two procedures by their rate of conver- 
gence. 

Definition 6 Procedure Pi is asymptotically better than P2 under the loss 
function dn if for some sequence e^j — >• 

lim Po{dH{po,Pi'^) < endH{pQ,P2^)) 1 



Under tliis definition we can state tlie following corollary, the definition of 
exact rate of convergence is similar to Definition 3 in Yang (2005). 

Corollary 7 Considering two procedures for density estimation where one 

is asymptotically better than the other. Suppose the exact rate of conver- 
gence of dniPo^Pi^^) and dH{po,P2^^) <^i"^ Pn and Qn respectively. Assume that 
^(Po,pS"'^) = 0{dK{po,pt'^^)),i = 1,2. Ifni ^ oo,n2 ^ 00, v^max(p„^, g„J 
00, then the cross validation is consistent in the sense of choosing the asymp- 
totically better procedure with probability tending to 1. 

Remark 8 If one of the procedures has nonparametric rate with a < 1/2, 
then ^ — 0(1) will suffice for Y^max(p„j, g„J — > 00 



3 Conclusions 



We give a simple proof of the consistency of cross validation in the context 
of density estimation. Although it is shown in Shao (1993) that leave-one- 
out cross validation is inconsistent for linear regression problem, it is unclear 
to us whether this is the case for nonparametric problems. Another interest- 
ing problem is that when multiple splittings are available, we can either use 
majority voting as in Yang (2005) or choose the procedure with the largest 
product of individual likelihood for each splitting. The comparison of these 
two approaches is similar to the tradeoff between model selection and model 
averaging. 
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